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Abstract 

This report describes the MUDOS-NG summarization system, which applies a set of 
language-independent and generic methods for generating extractive summaries. The pro- 
posed methods are mostly combinations of simple operators on a generic character n-gram 
graph representation of texts. This work defines the set of used operators upon n-gram 
graphs and proposes using these operators within the multi-document summarization pro- 
cess in such subtasks as document analysis, salient sentence selection, query expansion and 
redundancy control. Furthermore, a novel chunking methodology is used, together with a 
novel way to assign concepts to sentences for query expansion. The experimental results of 
the summarization system, performed upon widely used corpora from the Document Un- 
derstanding and the Text Analysis Conferences, are promising and provide evidence for the 
potential of the generic methods introduced. This work aims to designate core methods 
exploiting the n-gram graph representation, providing the basis for more advanced summa- 
rization systems. 

1 Introduction 

Since the late 50s and Luhn [Lulin, 1958] the information community has expressed its interest in 
summarizing texts. The domains of application of such methodologies range from news summa- 
rization [Wu and Liu, 2003, Barzilay and McKeown, 2005, Radev et al., 2005] to scientific arti- 
cle summarization [Teufel and Moens, 2002] and meeting summarization [Niekrasz et al., 2005, 
Erol ct al., 2003]. 

Summarization has been defined as a reductive transformation of a given set of texts, usually 
described as a three-step process: selection of salient portions of text, aggregation of the informa- 
tion for various selected portions and abstraction of this information, and finally, presentation of 
the final summary text [Mani and Bloedorn, 1999, Jones, 1999]. The summarization community 
aims to address major problems that arise during the summarization process. 

• How can one detect and select salient information to be included in the summary? Does 
the use of a query drive the information-selection task, and how? 

• How can one assure that the final summary does not contain redundant or repeated in- 
formation, especially when multiple documents are used as input to the summarization 
process? 



• Can one develop methods that will function independently from the language of documents 
and on what degree can this be achieved? 

Up to date, many summarization systems have been developed and presented, especially 
within such endeavors as the Document Understanding Conferences (DUC) and Text Analysis 
Conferences (TAC)^. The summarization community appears to have moved from single-text 
to multi-text input and has also reached such domains as opinion summarization and "trend" 
summarization, as in the case of NTCIR'^. However, different evaluations performed in recent 
years have proved that the multi-summarization task is highly complex and demanding, and that 
automatic summarizers have a long way to go to perform equally well to humans [Dang, 2005, 
Dang, 2006, Dang and Owczarzak, 2008]. It was recently shown [Gcnest et al., 2009] that the 
extractive approach has an upper limit of performance, which is lower when compared to the 
abstractive approach of humans. However, extractive summarization appears to have more room 
for improvement in order to reach that upper limit of performance, which is set by humans 
performing extractive summarization through simple sentence selection and reordering. 

Even the current methods for the evaluation of summaries are under criticism [Jones, 2007, 
Conroy and Dang, 2008, Giannakopoulos et al., 2008a]. Indeed, it has been shown that evaluat- 
ing different aspects of a summary is far from being a trivial task. Nevertheless, when it comes to 
ranking summarization systems, i.e., the average performance of a summarization system over 
a set of summaries, the existing evaluation methodologies offer quite good correlation to human 
judgment [Dang, 2006, Dang and Owczarzak, 2008, Giannakopoulos et al., 2008b]. 

Within this work we tackle the problems of salience detection and redundancy control in 
extractive multi-document summarization using a unified, language independent and generic 
framework based on n-gram graphs. The contributed methods offer a basic, language-neutral, 
easily adaptable set of tools. The basic idea behind this framework is that neighborhood and 
relative position of characters, words and sentences in documents offer more information than 
that of the 'bag-of-words' approach. Furthermore, the methods go beyond the word level of 
analysis into the sub-word (character n-gram) level, which offers further flexibility and indepen- 
dence from language and acts as a uniform representation for sentences, senses, documents and 
document sets. Through this study we provide a proof-of-concept methodology that can be used 
in more advanced summarization systems. We also experiment using this methodology as a basic 
summarization system, named MUDOS-NG. 

In the following sections we briefly review the current literature (section 2), we present the 
proposed approach (section 4) and perform a set of experiments to evaluate its performance and 
show its potential (section 5). The article concludes with discussion and proposals for further 
development (section 6). 

2 The Literature 

Presently, the literature of automatic multi-document summarization has grown to a level that 
is very hard to overview in detail [Jones, 2007]. However, one can identify specific commonalities 
in the way summarizers extract and reproduce information into output summaries. Summa- 
rizing systems are usually classified as being either extractive or abstractive in their ap- 
proach [Mani and Blocdorn, 1999]. Extractive approaches focus on the extraction and use of 
text chunks, i.e., text snippets, from the source texts in the final summary. Abstractive 
approaches, on the other hand, aim to first represent information using an intermediate repre- 
sentation, for example first-order logic, and then use language generation to produce the output 

^See http://duc.nist.gov/ and http://www.iiist.gov/tac/. 
■^See http : //research .nii . ac . jp/ntcir/ 
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summary from the representation. Even though it has been shown that humans summarize in 
an abstractive fashion [Banko and Vandcrwendc, 2004, Endrcs-Niggcmcyer, 2000], many current 
systems continue to use the extractive paradigm to perform summarization. This may be due 
to the lack of highly robust text-to-intermediate-representation methods as well as natural lan- 
guage generation methods. On the other hand, the description of a summarization system as 
purely abstractive can be disputed and it appears that summarization systems can differentiate 
themselves according to their purpose [Jones, 2007]. 

In the following paragraphs we overview the approaches existing for salience detection and 
redundancy removal, as this is the focus of our proposed work. We also make a brief literature 
review for graph-based methods, to introduce the reader to related work from the domain of 
graphs. 

2.1 Salience Detection 

To determine salience of information, researchers have used positional and structural properties of 
the judged sentences with respect to the source texts. These properties can be the sentence posi- 
tion (e.g., number of sentence from the beginning of the text, or from the beginning of the current 
paragraph) in a document, or the fact that a sentence is part of the title or of the abstract of a doc- 
ument [Edmundson, 1969, Radev et al., 2004]. Also, the relation of sentences with respect to a 
user-specific query or to a specified topic [Conroy et al., 2005, Varadarajan and Hristidis, 2006, 
Park ct al., 2006] are features providing evidence towards importance of information. Cohe- 
sion (proper name anaphora, reiteration, synonymy, and hypernymy) and coherence (based on 
Rhetorical Structure Theory [Mann and Thompson, 1987]) relations between sentences were used 
in [Mani et al., 1998] to define salience. Based on a graph representation where each sentence is 
a vertex and vertices are connected when there is a cohesion or coherence relation between them 
(e.g., common anaphora), the salience of a sentence is computed as an operation dependent on 
the graph representation (e.g., spreading activation starting from important nodes). 

Often, following the bag-of-words assumption, a sentence is represented as a word-feature 
vector e.g., [Torralbo et al., 2005]. In such cases, the sequence of the represented words is 
ignored. The vector dimensions represent word frequency or the Term Frequency - Inverse 
Document Frequency (TF-IDF) value of a given word in the source texts. In other cases, further 
analysis is performed, aiming to reduce dimensionality and produce vectors in a latent topic 
space [Steinbcrgcr and Jczek, 2004, Florcs et al., 2008]. Vector representations can be exploited 
for measuring the semantic similarity between information chunks, by using measures such as 
the cosine distance or Euclidean distance between vectors. 

When the feature vectors for the chunks have been created, clustering of vectors can be per- 
formed for identifying clusters corresponding to specific topics. A cluster can then be represented 
by a single vector, for example the centroid of the corresponding cluster's vectors [Radev et al., 2000]. 
Chunks closest to these representative vectors are considered to be the most salient. We must 
point out that for the aforementioned vector-based approaches one needs to perform preprocess- 
ing to avoid pitfalls due to stop- words and inflection of words that create feature spaces of very 
high dimension. However, the utility of the preprocessing step, which usually involves stemming 
and stop-word removal, is an issue of dispute [Lcdencva, 2008, Leitc ct al., 2007]. 

More recent approaches use machine learning techniques and sets of different features to 
determine whether a source text chunk (sentence) should be considered salient and included in 
the output summary. In that case the feature vector calculated for every sentence may include 
information like sentence length, sentence absolute position in the text, sentence position within 
its corresponding paragraph, number of verbs and so forth (e.g., see [Tcufel and Mocns, 2002]). 

It has been shown that for specific tasks, like the news summarization task of DUC, sim- 
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pie positional features for the determination of summary sentences can prove very promising 
for summarization systems [Dang, 2005]. However, this may falsely lead to the expectation 
that features like the ones corresponding to the 'first-sentence' heuristic, i.e., the features that 
indicate whether a sentence appears to be similar in content and properties to the first sen- 
tences of a set of training instances, can be used as a universal rule. The example of short 
stories is an example where a completely different approach is taken to perform the summa- 
rization [Kazantseva and Szpakowicz, 2010]: The summary is expected to describe the setting 
without giving away the plot. In [Jatowt and Ishizuka, 2006], we find an approach where time- 
aware summaries take into account the frequency of terms over time in different versions of web 
pages to determine salience. 

In [Varadarajan and Hristidis, 2006] the authors create a graph, where the nodes repre- 
sent text chunks and edges indicate relation between the chunks. However, in contrast to 
our work, the authors in [Varadarajan and Hristidis, 2006] consider as optimal summary the 
maximum spanning tree of the document graph that contains all the keywords; the graph 
in [Varadarajan and Hristidis, 2006] is not a character n-gram graph, there is no proposed 
methodology for the chunking (other than a given parsing delimiter parameter) , not all the edges 
are kept (only those above a given threshold parameter). Furthermore, in [Varadarajan and Hristidis, 2006] 
the 'TF-IDF'-related Okapi function is used to assign weights to nodes, indicating self-importance 
of a node. 

In multi-document summarization, different iterative ranking algorithms like PageRank [Brin and Page, 1998] 
and HITS [Klcinbcrg, 1999] over graph representations of texts have been used to determine the 
salient terms over a set of source texts [Mihalcca, 2005] . Salience has also been determined by 
the use of graphs, based on the fact that documents can be represented as 'small world' topology 
graphs [Matsuo et al., 2001], where important terms appear highly linked to other terms. Find- 
ing the salient terms, one can determine the containing sentences' salience and create the final 
summary. 

In another approach [Hcndrickx and Bosma, 2008], content units (sentences) are assigned a 
normalized value (0 to 1) based on a set of graphs representing different aspects of the content 
unit. These aspects include: query- relevance; cosine similarity of sentences within the same 
document (termed relatedness); cross-document relatedness, which is considered an aspect 
of redundancy; redundancy with respect to prior texts; and coreference based on the number 
of coreferences between different content units. All the above aspects and their corresponding 
graphs are combined into one model that assigns the final value of salience using an iterative pro- 
cess. The process spreads importance over nodes based on the 'probabilistic centrality' method 
that takes into account the direction of edges to either augment or penalize the salience of nodes, 
based on their neighbors' salience. 

The notion of Bayesian expected risk (or loss) is applied in the summarization domain 
by [Kumar et al., 2009], where the selection of sentences is viewed as a decision process, where 
the selection of each sentence is considered a decision and the system has to select the sentences 
that minimize the risk. 

The CLASSY system (e.g., see [Conroy et al., 2007], [Conroy et al., 2009]) extracts fre- 
quently occurring ('signature') terms from source texts, as well as terms from the user query. 
Using these terms, the system estimates an 'oracle score' for sentences, which relates the terms 
contained within the candidate sentences to an estimated 'ideal' distribution based on term ap- 
pearance in the query, the signature terms and the topic document cluster. Some optimization 
method (mostly Integer Linear Programming) is then used to determine the best set of sentences 
for a given length of summary, given sentence weights based on their 'oracle score'. 

This article proposes a summarization method that uses language-independent and generic 
operators that apply to a generic representation of chunks based on interconnected n-grams. 
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The 'n-gram graph'-based approach has much in common with [Giannakopoulos et al., 2008b], 
for chunk (and, thus, sentence) similarity. This method overcomes the need for any kind of 
preprocessing and offers a basic (i.e. core) method for extractive summarization. Even the 
chunking process, which separates a sentence into sub-sentential strings, is based upon statistical 
analysis of a given document set. The method does not use the bag-of- words approach, as the 
n-gram graphs take into account the relevant position of n-grams in the text. We do not use 
any features like sentence position or part-of-speech information. Our method does not deduce 
salience based on the centrality or connectedness of graph vertices. 

The method extracts a graph expected to represent the common content of input texts, which 
is in turns considered to indicate salient information. Given a user query, the approach combines 
— using n-gram graph operators — the common content graph with the given query into an 
overall importance-indicative graph. Then, we calculate the salience of source text sentences 
based on the similarity of their respective n-gram graph representation to the overall graph, i.e., 
the more a sentence representation is similar to the representation of content and query, the 
more it is considered appropriate for the final summary. Our methodology is described in depth 
in section 4. 

2.2 Redundancy and Novelty 

A problem that is somewhat complementary to salience selection is that of redundancy detec- 
tion. Whereas salience, which is a desired attribute for the information chunks in the summary, 
can be detected via measuring the similarity of these chunks to a query, redundancy indicates the 
unwanted repetition of information. Research on redundancy has given birth to the Marginal Rel- 
evance measure [Carbonell and Goldstein, 1998] and the Maximal Marginal Relevance (MMR) 
selection criterion. The basic idea behind MMR is that 'good' summary sentences (or docu- 
ments) are sentences (or documents) that are relevant to a topic without repeating information 
already in the summary. The MMR measure is a generic linear combination of any two principal 
functions that can measure relevance and redundancy. 

Another approach to the redundancy problem is that of the Cross-Sentence Informational 
Subsumption (CSIS) [Radcv et al., 2000], where one judges whether the information offered by 
a sentence is contained in another sentence already in the summary. The 'informationally sub- 
sumed' sentence can then be omitted from the summary. The main difference between the two 
approaches is the fact that CSIS is a binary decision on information subsumption, whereas the 
MMR criterion offers a graded indication of utility and non-redundancy. 

Other approaches, overviewed in [Allan et al., 2003], use statistical characteristics of the 
judged sentences with respect to sentences already included in the summary to avoid repeti- 
tion. Such methods are the New Word and Cosine Distance methods [Larkey et al., 2003] that 
use variations of the bag-of-words based vector model to detect similarity between all pairs of 
candidate and summary sentences. Other, language model-based methods create a language 
model of the summary sentences, either as a whole or independently, and compare the language 
model of the candidate sentence to the summary sentences model [Zhang et al. , 2002] . The can- 
didate sentence model with the minimum KL-divergence from the summary sentences' language 
model is supposed to be the most redundant. 

The CLASSY system [Conroy et al., 2007, Conroy et al., 2009] represents documents in a 
term vector space and enforces redundancy through the following process: Given a pre-existing 
set of sentences A corresponding to a sentence-term matrix Ma, and a currently judged set 
of sentences B corresponding to a matrix Mb, B is judged using the term sub-space that is 
orthogonal to the eigenvalues of the space defined by A; this means that only terms that are not 
already considered important in A will be taken into account as valuable content. 
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In this work, we have used and evaluated two different strategies for the detection of informa- 
tion redundancy. These strategies use a statistical, graph-based model of sentences by exploiting 
character n-grams. The first strategy, similarly to CSIS, compares all the candidate sentences 
and determines the redundant ones. The second strategy, aiming to detect intra-summary nov- 
elty and more similar to MMR, creates an iterative n-gram graph model for every snapshot of the 
summary after a new sentence is added to it. Then, each new candidate sentence is compared 
to that graph model to determine redundancy. 

2.3 Graph-based Methods and Graph Matching 

Graphs have been used to determine salient parts of text [Mihalcea, 2004, Erkan and Radev, 2004a, 
Erkan and Radev, 2004b] or query related sentences [Ottcrbacher et al., 2005] in close relation to 
the summarization process. Lexical relationships [Mohamed and Rajasekaran, 2006] or rhetor- 
ical structure [Marcu, 2000] and even non-apparent information [Lamkhede, 2005] have been 
represented by graphs. Graphs have also been used to detect differences and similarities be- 
tween source texts [Mani and Bloedorn, 1997] and inter-document relations [Wittc ct al., 2006], 
as well as relations of various granularity from cross-word to cross-document as described in 
Cross-Document Structure Theory [Radev, 2000]. 

In this work, the graphs are used to represent strings of any length or granularity, from 
chunk to sentence, to document set. Throughout the proposed methodologies, we use a set of 
different operators, like similarity, merging, intersection, to perform different subtasks of the 
summarization process (query expansion, content selection, redundancy control). 

Graph similarity calculation methods can be classified in two main categories, according to 
the literature. 

Isomorphism-based Isomorphism is a bijective mapping between the vertex set of two graphs 
Vi,V2, such that all mapped vertices are equivalent, and every pair of vertices from Vi 
shares the same state of neighborhood, as their corresponding vertices of V2. In other 
words, in two isomorphic graphs all the nodes of one graph have their unique equivalent 
in the other graph, and the graphs have identical connections between equivalent nodes. 
Based on the isomorphism, a common subgraph between Vi,V2, can be defined as a 
subgraph of Vi having an isomorphic equivalent graph V3 , which is a subgraph of V2 . The 
maximum common subgraph of Vi and V2 is defined as the common subgraph with the 
maximum number of vertices. For more formal definitions and an excellent introduction 
to the error-tolerant graph matching, i.e., fuzzy graph matching, see [Bunke, 1998]. 

Given the definition of the maximum common subgraph, a series of distance measures have 
been defined using various methods for the calculation of the maximum common subgraph, 
or similar constructs like the Maximum Common Edge Subgraph, or Maximum Common 
Induced Graph (also see [Raymond et al., 2002]). 

Edit-distance Based Edit distance has been used in fuzzy string matching for some time now, 
using many variations (see [Navarro, 2001] for a survey on approximate string matching). 
The edit distance between two strings corresponds to the minimum number of edit charac- 
ter operations (namely insertion, deletion and replacement) needed to transform one string 
to the other. Based on this concept, a similar distance can be used for graphs [Bunke, 1998]. 
Different edit operations can be given different weights, to indicate that some edit opera- 
tions indicate more important changes than others. The edit operations for graphs' nodes 
are node deletion, insertion and substitution. The same three operations can by applied 
on edges, giving edge deletion, insertion and substitution. 
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Using a transformation from text to graph, the aforementioned graph matching methods can 
be used as a means to indicate text similarity. A graph method for text comparison can be 
found in [Tomita et al., 2004], where a text is represented by first determining weights for the 
text's terms using a TF-IDF calculation and then by creating graph edges based on the term co- 
occurrences. The method proposed in this article does not require term extraction/identification 
and the corresponding representation graph is constructed by exploiting the text in a direct 
manner (i.e. no language-dependent preprocessing is required), without exploiting further back- 
ground/supportive information such as a corpus for the calculation of TF-IDF or any other 
weighting factor. 

The main difference of our method to existing methods is that we enter the sub-word level, 
through the use of character n-grams. We aim to perform all required tasks towards the sum- 
marization of a set of documents using a uniform representation of sentences, senses, documents 
and document sets, regardless of the underlying language. Moreover, we map seemingly different 
summarization subtasks, such as content selection and query expansion, to a set of basic graph 
operators that function as generic purpose NLP operators over a common representation. 

In order to understand the n-gram graph representation we use, one should take into account 
the fact that adjacency between different linguistic units within specific contexts seems to be 
a very important factor of meaning. Contextual information has been widely used and several 
methodologies have been built upon its value (e.g., [Burgess et al., 1998, Yarowsky, 1995]). 



Our methodology, described in detail in section 4, can be summarized by the following main 
steps: 

• Analysis of source documents' content. 

• Query expansion. 

• Candidate content grading. 

• Redundancy removal or novelty detection. 

• Composition of the summary. 

In the following paragraphs we present the framework (i.e. the representation and the operators) 
that is used throughout these steps. Having said that, we need to emphasize on the point that 
a single framework allows for different operators on the Natural Language Processing (NLP) 
domain. An early presentation of the concepts and processes described herein can be found 
in [Giannakopoulos et al., 2008a]. 

3 N-gram Graphs, Operators and Algorithms 

We now provide the definition of n-gram, given a text (viewed as a character sequence): 

Definition 3.1 // n > 0, n e Z, and Ci is the i-th character of an l-length character sequence 
T' = (ci, C2, q) (our text), then 

a character n-gram S" = (si, S2, s„) is a subsequence of length n of fori e 

[1,Z — n + 1]: and j G [1,?^]: Sj = Ci+j-i- We shall indicate the n-gram spanning from Ci to 
Ck,k> i, as Si^k, while n-grams of length n will be indicated as 5". 

The meaning of the above formal specification, is that n-gram 5" can be found as a substring 
of length n of the original text, spanning from the i-th to the (i -f- n — l)-th character of the 
original text. The length n of an n-gram is called either the length, size or the rank of the 
n-gram. 
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The n-gram graph is a graph G = {V'^ , E'^ , L, W}, where V*^ is the set of vertices, E'^ is 
the set of edges, L is a one-to-one function assigning a label to each vertex and edge and W is 
a function assigning a weight to every edge. We consider that L labels edges by concatenating 
the labels of the corresponding vertices in the direction of the edge, if the edge is directed, or in 
lexicographic order, if the edge is undirected, adding also a special separator character. Labels 
in vertices are n-grams, v'-^ S V'^ , and the edges e*^ S E"^ (the superscript G will be omitted 
where easily assumed) connecting the n-grams indicate adjacency of the corresponding vertex 
n-grams in a specific context within distance -Dwin (also see [Giannakopoulos et al., 2008b]). 

Example 3.2 Vertex vi corresponds to n-gram "abc" and V2 to "bed". Then, L{vi) = "ahc", 
L{v2) ="hcd". If ei — (wi,f2) the edge connecting vi and V2, then L{ei) — "abc-^ bed", where 
— > is the special separator character. 

The edges within this work are weighted by applying the number of co-occurrences of the 
vertices' n-grams within the given window in the original documents. For simplicity, when for a 
vertex v, v € V'^ we may also write v € G; the same notation may be used for an edge e S E'-^ , 
where we may write e € G. 

Given two instances of n-gram graph representation Gi , 6*2 , there is a number of operators 
that can be applied on Gi,G2 to provide the n-gram graph equivalent of union, intersection 
and other such operators of set theory. In our summarization task, these operators are useful 
as primary tools for all the subtasks — i.e., salience detection, novelty detection, redundancy 
removal, query expansion. An example of such an operator is the merging operator of Gi and 
G2 corresponding to the union operator in set theory. This operator adds all edges from both 
operand graphs to a third one, while making sure no duplicate edges are created. Two edges 
are considered duplicates of each other, when they share identical vertices. We note that the 
definition of identity between vertices can be customized; within our applications two vertices 
are the same if they correspond to the same n-gram. 

The definition of the graph operators is actually non-trivial, because a number of questions 
arise, such as the handling of weights on common edges after a union operator or the 'meaning' 
and thus handling of the zero-weighted edges after the application of any operator. 

All the operators that we shall present operate on edges only, because we consider single 
nodes to be of little value. We argue that information is contained within the relation between 
n-grams and not in the n-grams themselves. Therefore, our minimal unit of interest is the edge, 
which is actually a pair of vertices. 

Overall, we have defined a number of operators all of which — with the exception of similarity 
— are functions from G x G to G, where G is the set of valid n-gram graphs of a specific rank 
n. Thus, operators function upon graphs of a given rank and produce a graph of the same rank. 
The operators are the following. 

• The similarity function sim : G x G — > M which returns a value of similarity between two 
n-gram graphs. This function is symmetric, in that sim(Gi,G2) = sim(G2,Gi). There are 
many variations of the similarity function within this study, each fitted to a specific task. 
The common-ground of these variations is that the similarity values are normalized in the 
[0, 1] interval, with higher values indicating higher actual similarity between the graphs. 
The computation of similarity is described in subsection 3.1. 

• The containment function contains, which indicates what part of a given graph Gi is 
contained in a second graph G2. This function is expected to be asymmetric. In other 
words, should the function indicate that a graph Gi is contained in another graph G2, we 
know nothing about whether the inverse stands. 
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In the implementations of the containment function proposed herein, result values are 
normalized in the [0, 1] interval, with higher similarity values indicating that a bigger part 
of a given graph Gi is contained in a second graph 6*2 • We consider a graph to be contained 
within another graph if all its edges appear within the latter. If this is not the case, then 
any common edge contributes to the overall containment function a percentage inversely 
proportional to the size of Gi, so that the function value indicates what part of Gi is 
contained within G2. The computation of containment is described in subsection 3.1. 

• The merging or union operator U, returns a graph with all the edges, both common and 
uncommon, of the two operand graphs. In our implementation of this operator we have 
decided that the union operator will set the weight of a common edge equal to the average 
of the weights into the corresponding graphs. 

• The intersection operator H, returns a graph with the common edges of the two operand 
graphs, with averaged weights. The averaging over edge weights is based on the idea that 
the intersection (and the union) of two graphs should be a graph that includes the edges 
of both operand graphs, with edge weights as close as possible to both the original graphs. 
The union (merge) and intersection operators are presented in section 3.2. 

• The delta operator (also called all-not-in operator) A returning the subgraph of a graph 
Gi that is not common with a graph G2. This operator is non-symmetric, i.e., Gi A G2 7^ 
G2 A Gi , in general. 

• The inverse intersection operator y returning all the edges of two graphs that are not 
common between the graphs. This operator is symmetric, i.e., Gi V G2 = G2 V Ci. 

Zero-weighted edges are treated as all other edges, even though zero weight means that the 
edge does not exist (i.e., the vertices are not related). The empty graph is a graph with no 
nodes and no edges. The size of any graph is its edge count and is indicated as |Gi| for graph 
Gi. 

The algorithm we use [Giannakopoulos et al., 2008b] to convert a given string to its character 
n-gram graph representation is quite simple: 

• Extract all character n-grams of rank n from a given text and create graph vertices, one 
for every unique n-gram. The vertices are labeled by their corresponding n-gram. 

• Add edges connecting all n-grams that occur (at least once) within a given distance I?win 
of each other in the string. In this work, the weight of the added edge is the number of 
co-occurrences of the corresponding vertices'^. 

3.1 Similarity and Containment 

To represent a character sequence or text we can use a set of n-gram graphs, for various n-gram 
ranks , instead of a single n-gram graph. To compare a sequence of characters in a chunk, a 
sentence, a paragraph or a whole document (i.e. in any textual chunk), we apply variations of 
a single algorithm that acts upon the n-gram graph representation of the character sequences. 
The algorithm is actually a similarity measure between two n-gram graph sets corresponding 
to two texts Ti and T2. This similarity can be indicative of the similarity of content of two 
information chunks, in the way any fuzzy string matching technique is. We can therefore apply 

After the application of operators between graphs, the edge weights, e.g., of union graphs, may represent 
average co-occurrences, or other functions of co-occurrences, thus not being integer numbers. 
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this similarity measurement to determine whether e.g., a given sentence is related to a user 
query of a summarization task. 

We consider that an n-gram graph maps to a given n-gram rank in this work, i.e., the 
rank r is a parameter of the n-gram graph. Given that the representation of a text is a 
set of graphs Gj, containing graphs of various ranks, and two texts Ti,T2 we use the Value 
Similarity VS*^ to compare graphs of the same rank. So, for every n-gram rank r of Gi, G2 we 
use the corresponding graph of rank r, e.g., Gi S Gi and G2 S G2. The VS''(Gi, G2) measures 
how many of the edges contained in graph Gi are contained in graph G2, considering also the 
weights of the matching edges. In this measure each matching edge e having weight wl in graph 
Gi contributes maxQc!] IG2I) ^^^^ sum, while not matching edges do not contribute (consider 
that for an edge e ^ Gi we define wl — 0). The ValueRatio (VR) scaling factor is defined as: 

VR(e) = (1) 

The equation indicates that the ValueRatio takes values in [0, 1] , and is symmetric. Thus, the 
full equation for VS^ is: 

^^(^'^^= max(|G^|,|G^|) 

VS'^ is a measure converging to 1 for graphs that share edges with similar weights, which means 
that a value of VS'' = 1 indicates perfect match between the compared graphs. Another impor- 
tant measure is the Normalized Value Similarity (NVS), which is computed as: 

NVS-(Gi,G2) = — (3) 

max(|Gi|,|G2|) 

The fraction SS''(Gi,G2) = is also called Size Similarity. The NVS is a measure 

of similarity where the ratio of sizes of the two compared graphs does not play a role. 

The overall similarity VS'^of the sets Gi, G2 is computed as the weighted sum of the VS over 
all ranks: 

VSO(Gi,G2) = ^-y-^'^M^x] ^ 

where VS'^ is the VS measure for extracted graphs of rank r in G, and Lmin, ^max are arbitrary 
chosen minimum and maximum n-gram ranks. 

The function contains() realizing the graph containment operator has a small, but significant 
difference from the value similarity function: It is not commutative. More precisely, if we call 
Value Containment (VC) the containment function using edge weights, then VC is: 

VC'-(Gi, G2) = ^^^^^^^ ^^^^I'-'-i) . (5) 

The denominator is the cause for the asymmetric nature of the function and makes it corre- 
spond to a graded membership function between two graphs. 

3.2 Graph Union (or Merging) and Intersection 

The union, or merge, operator U has two important aspects. The first deals with unweighted 
edges as pairs of labeled vertices e = {vi,V2), while the second considers the weights of the edges 
as well. 
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The merge operator defines the basic operator required to perform updates in the graph 
model. The intersection operator, on the other hand, can be used to determine the common 
subgraphs of different graphs. This use has a variety of applications, such as common topic 
detection in a set of documents (see section 4.3) or the detection of 'stopword-effect' edges. The 
'stopword-effect' edges are edges that are apparent in the graphs of most texts of a given language 
and have high frequency, much like stopwords. 

The detection of stopword-effect edges in n-gram graphs can be accomplished by simply 
applying the intersection operator upon graphs of (adequately big) texts of various different 
topics. The resulting graph will represent that part of language that has appeared in all the text 
graphs and can be considered 'noise'. More on the notion of noise in relation to n-gram graphs 
can be found in section 3.4.1. 

When performing the union of two graphs we create a graph G1UG2 = G" — {V^, i?", L, VF"), 
such that iJ" = U E2, where i?p,i?2^ are the edge sets of Gi,G2 correspondingly. In out 
implementation we consider two edges to be equal ei = 62 when they have the same label, i.e., 
L(ei) — L(e2)^, which means that the weight is not taken into account when calculating E". 

To calculate the weights in G" there can be various functions depending on the effect the 
merge should have over weights of common edges. One can follow the fuzzy logic paradigm and 
keep the maximum of the weights of a given edge in two source graphs (e) — max( Wi (e) , W2 (e) ) , 
where Wi , W2 are the weighting functions of the corresponding graphs and e is a common edge 
of Gi and G2. Another alternative would be to keep the average of the values so that the weight 
represents the expected value of the weights of the original weights. Given these basic alterna- 
tives, we chose the average value as the default union operator effect on edge weights. It should 
be noted that the merging operator is a specific case of the graph update function presented in 
section 3.4. Formally, if £'^,i?^ are the edge sets of Gi,G2 correspondingly, is the result 
graph edge weighting function and Wi , W2 are the weighting functions of the operand graphs 
with e ^ E^ ^ W"{e) ~ 0,i e 1,2, then the edge set of the merged graph is: 

E- = E^U E\ W-{e) = ElM±i^, e G {E^ U E^) (6) 

The intersection operator H returns the common edges between two graphs Gi , G2 performing 
the same averaging operator upon the edges' weights. Formally the edge set i?* of the intersected 
graph is: 

E' = {e|e G Gi A e G G2}, W\e) = ^^M±^^^^ ^ (^1 ^ ^2) (7) 

3.3 Delta (All-not-in) and Inverse Intersection 

The Delta or all-not-in operator A is a non-commutative operator, that given two graphs Gi, G2 
returns the subset of edges from the first graph, that do not exist in the second graph. Formally 
the edge set E^ is: 

= {e|e G Gi A e ^ G2} (8) 

The weight of the remaining edges is not changed when applying the delta operator. Obviously, 
the operator is non-commutative. 

A similar operator is the inverse intersection operator \j which creates a graph that only 
contains the non-common edges of two graphs. The difference between this operator and the 
delta operator is that in the inverse intersection the edges of the resulting graph may belong to 
any of the original graphs. Formally the resulting edge set E^ is: 

-B^ ={e|eG (GiUG2)Ae^ (GiHGa)} (9) 

''We consider the labeling function to be the same over all graphs. 
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Both the delta and the inverse intersection operators can be appUed to determine the differ- 
ences between graphs. This way one can e.g., remove a graph representing 'noisy' content from 
another graph. Another apphcation is determining the non-common part of two texts that deal 
with the same subject, which may refer to the unique or novel part of each text, with respect to 
the subject. 

3.4 Representing Document Sets - Updatability 

The n-gram graph representation specification indicates how to map a text to an n-gram graph. 
However, in our task it is required to represent a whole document set. The most simplistic way 
to do this using the n-gram graph representation would be to concatenate the documents of the 
set into a single overall document, but this kind of approach would not offer an updatable model 
— i.e., a model that could easily change when a new document enters the set. 

In our applications we have used an update function U that is similar to the merging 
operator, with the exception that the weights of the resulting graph's edges are calculated in a 
different way. The update function U (Gi, G2, 1) takes as input two graphs, one that is considered 
to be the pre-existing graph Gi (i.e. a graph that may have resulted by a sequence of applications 
of the update operator on an initial graph) and one that is considered to be the new graph G2. 
The function also has a parameter called the learning factor I e [0, 1], which determines the 
sensitivity of Gi to the changes G2 brings. 

Focusing on the weighting function of the graph resulting from the application of U{Gi, G2, 1), 
the higher the value of learning factor, the higher the impact of the new graph to the existing 
graph. More precisely, the definition of the weighting performed in the graph resulting from U 
is: 

W\e) = W\e) + {W^{e)~W^{e))xl (10) 

According to this formula, the value oi I — indicates that Gi will completely ignore the new 
graph. A value of / = 1 indicates that the weights of the edges of Gi will be assigned the values 
of the new graph's edges' weights. A value of 0.5 gives us the simple merging operator. 

The U operator allows using the graph as a representation model for a set of documents. This 
approach is used in our case to represent the common content of source documents. The 
training step for the creation of the content representation model comprises the initialization of 
a common content graph with the representation of the first document and the subsequent 
update of that initial graph with the graphs of the other documents. Especially, when one wants 
the common content graph's edges to hold weights averaging the weights of all the individual 
graphs that have contributed to the common content graph, then the i-th new graph that updates 
the common content graph should use a learning factor of 

; = 1.0- ^— > 1 (11) 

i 

This methodology creates a common content graph that can function as a representative graph 
for all the source documents, in that we expect it to be as 'close' as can be to the individual 
graphs of the individual documents, in terms of value similarity. 

When the common content graph is created, one can determine whether a new document 
is similar to the content of the source documents by measuring the similarity of the document 
graph to the common content graph. This information has been used to determine the salience of 
information chunks in section 4.3. We have further used the update operator to determine noisy 
information, in terms of useless graph edges. This is illustrated in the following paragraphs. 
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3.4.1 Determining Noise Using the N-gram Graph Operators 

When determining the common content graph, one faces the presence of noise within the graph. 
The noise within the graphs for a classification task would be the set of common subgraphs 
over all classes of documents, as they do not offer distinctive information. Similarly, in our 
summarization task, we consider that the part of the common content graph that would appear, 
no matter the underlying topic of the sources, is noise. In traditional text classification techniques 
stopword removal is usually used to remove what is considered to be the noisy part of the data. 
Up to this point we have seen that n-gram graphs do not need such preprocessing. However, 
based on the task, we can usually identify non-interesting parts of data that hinder the task. 
This 'noise' can be removed via the already proposed n-gram graph algorithms. 

In order to see what we should do to remove the noise, we will use the related paradigm of 
a classification task. For the classification task, we consider to have a set of training documents 
for a number of classes. We consider noise the part of information in the graphs that does not 
help determine the class of a document. If we manage to find this information, as represented in 
the graphs, we will be able to remove the noise from the common content graph. 

In the case of a classification task, we create a merged graph — using the update operator 
— for the full set of training documents Tc belonging to each class c. After creating the classes' 
model graphs, one can determine the maximum common subgraph between classes and remove 
it to improve the distinction between different classes. A number of questions arise given this 
train of thought: 

• How can one determine the maximum common subgraph between classes? According to 
our operators, it is easy to see that the maximum common subgraph is the intersection 
of all the class graphs. In other words, the same operator that is used to determine the 
common content within a class of documents, becomes useful as noise indicator when doing 
inter-class analysis. 

• Is this (sub)graph unique? No, it is not. Even though the intersection operator is associative 
if edge weights are omitted, the averaging of edge weights per intersection causes non- 
associativity. In other words, (Gi n G2) n 6*3 7^ Gi n (G2 H G3), due to the calculations of 
the edge weights . 

• Can we approximate the noisy subgraph, without iterating through all the classes? Yes. 
As can be seen in Figure 1 the noisy subgraph can be easily approximated in very few 
steps. In the figure the horizontal axis indicates the number of consecutive intersections 
performed between the classes' graphs and the vertical axis illustrates the number of edges 
of the resulting intersected graph. It shows that even from the third iteration there is only 
insignificant change in the resulting graph size. 

• Does the removal of the noisy ( sub ) graph from each class graph really improve results ? The 
answer is yes, as we will immediately show. 

To support our intuition that there is a common part of the n-gram graphs whose effect is 
the same as that of noise (i.e., what we have called stopword-efFect edges), we performed 
a set of experiments on classification, which can be easily related to performing topic detection 
over a set of topics, or to performing common content extraction in the analysis step of our 
summarization methodology. We created a topic graph for each of the 48 TAG 2008'"' topics, 
based on the documents contained within each topic. Then, we tried to classify each of these 
documents to a corresponding topic. 

^See the TAG site, at http://www.nist.gov/tac, for more. 
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Figure 1: Convergence of common graph size over the number of intersections 



The classification was performed by measuring the similarity of a judged document to a set 
of topic-representative graphs. The topic of the graph with the maximum similarity was selected 
as the topic of the document. This way, we wanted to see if all documents would be found to 
be maximally similar to their topic graphs, since the training instances would be expected to be 
recognized as belonging to their original topic. If that was not the case, then perhaps this would 
be the effect of common elements between the content of different topics, i.e., noise. As an 
indication of the performance of the classification we use the recall value, i.e., the value of the 
number of documents that were correctly indicated as belonging to a given class, divided by the 
numbers of all documents that belong to the class. The histogram of recall values for all classes 
(topics) before removing the noise is shown in figure 2a. The recall histogram after removing the 
alleged noise can be seen in figure 2b. The y-axis indicates the number of classes for which the 
recall value was within a given range (x-axis). Thus, the x-axis indicates the recall value ranges. 
Of course, ideally, all the values should be 1.0, to indicate perfect classification. From the figure 
it is shown that several class results are moved towards higher recall values. 

To see whether the results illustrate statistically significant improvement, we used a paired 
Wilcoxon ranked sum test [Wilcoxon, 1945] , because of the non-normal distribution of differences 
between the recall of each class in the noisy and noise-free tests. The test indicated that within 
the 99% statistical confidence level (p-value < 0.001) the results when removing the noise were 
indeed improved. The conclusion of this experiment is that the removal of noise can really 
help determine relation to a topic using the n-gram graph representation. This advances the 
effectiveness of the content selection process when a noise-free topic graph is available. Therefore, 
in our content analysis of the original graphs, we make use of the noise removal process to keep 
a noise-free graph. 
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Figure 2: Recall values histogram for all the classes used in the classification task with and 
without noise. 



4 The Proposed Methodology and System: MUDOS-NG 

In this section we provide an overview of the MUDOS-NG system (also see Figure 3), as well as 
an in-depth analysis of the proposed individual steps. 

The analysis of source documents' content step gets as input a set of source documents 
from which it extracts and represents, through the n-gram graph representation of documents 
and operators such as the intersection, the information that is common. This information is 
considered to be important, due to its presence in all the source documents. The output of the 
step is the common content graph (see more in section 4.1), representing the common part of 
the source documents. 

The query expansion step, which is an optional step, aims to annotate a user-supplied 
query sentence, indicating the subject of the requested summary, with a set of concepts so as to 
expand it. The input of this step is the query, which is expected to be free-text, and the output 
is a graph representing the expanded version of the query. 

The candidate content grading step assigns scores to sentences from the source documents, 
in order to evaluate the salience of each sentence. The grading takes into account the user query as 
well as the information common to all source documents and outputs an ordered list of sentences. 

The next step uses either a redundancy removal or a novelty detection approach to avoid 
repetition in the output summary. This step either eliminates or reranks the already graded and 
sorted candidate summary sentences, in order to avoid or penalize repetition of information. This 
step is in fact integrated in the summary creation (final) step. Its input is the set of candidate 
sentences and its output is a (possibly reranked) subset of the candidate sentences. 

Given the output of the previous steps, the final step creates the summary as a sequence of 
selected candidate sentences. We have not used any sentence ordering method to improve the 
output of the system, as our main purpose was to determine whether the n-gram graph based 
tools we devised can be useful throughout the summarization process. This system aims to 
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Figure 3: MUDOS-NG System Overview 
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provide core methods exploiting the n-gram graph representation, providing the basis for more 
advanced summarization systems. Experiments indicate that indeed, even without the sentence 
ordering or any rewrite methodology, our system provides promising results. 

4.1 Analysis of Source Documents' Content 

To analyze the source documents we need to be able to identify and represent the minimal units 
of information. In other approaches this minimal unit of information would be the word, but we 
need to remain language independent and also take into account the fact that a word can also be 
split in sub-word parts. To do this we use the next-character entropy chunking (section 4.1.1). 
Each chunk, which may be part of a word or longer than a word (e.g., part of a collocation) 
is then represented by its corresponding n-gram graph. To identify whether this splitting of the 
sentence into smaller parts makes any difference in judging the salience of a sentence, we perform 
experiments (see section 5) taking into account two alternatives: creating a graph per sentence 
versus creating a graph per chunk. In the following subsection, we describe the solutions we have 
applied to identify these chunks. 

4.1.1 Next-character Entropy Chunking 

To determine the information chunks that will be used in the steps leading to the final summary, 
we need to determine the appropriate delimiters. To do this in a language-neutral way, we exploit 
our document corpus to determine the probability P{c\Sn) that a single given character c will 
follow a given character n-gram Sn, for every character c in the corpus. The probabilities can 
then be used to calculate the entropy of the next character for a given character n-gram Sn, as 
follows, li Ci,i & N,0 < i < M the set of characters that have been found to follow the given 
n-gram Sn and fi the frequency (count) of Ci being found after Sn, then 

P,^P(c.\Sn)^^ (12) 

The next-character entropy of H{Sn) in the document corpus is given by: 

M 

H(Sn) = -Y.P^\0g^P^ (13) 

i=l 

The entropy measure indicates uncertainty. We have supposed that substrings of a character 
sequence where the entropy of P(c\Sn) surpassed a statistically computed threshold represent 
candidate delimiters. The threshold is based on the analysis of entropy values that are illustrated 
in Figure 4 on the left. We noted that one can detect three fuzzy regions in the entropy graph: 
The first is the region containing delimiters, the second is the region containing non-delimiters 
and the third contains symbols that have very low entropy of next character, i.e., they are part 
of common syllables. The regions are defined by non-trivial changes in the curve of the entropy 
measure. 

To detect the most prominent changes we measured the delta of the entropy values for 
successive symbols in the graph. The delta Dh is the absolute value of the change in entropy 
between two successive symbols and is illustrated in Figure 4 on the right. In both figures the 
horizontal line parallel to the Symbol axes indicates the mean value, of entropy and entropy 
delta in each figure correspondingly. Given the ordered set H — (Hi, ...,Hm) of values H{Sn) 



17 



Per Symbol Entropy 



Delimiters r 


























Svllable 


J 


Normal 




Parts 
















/ 

/ 








1 











0,45 




0,4 




0,35 




0,3 








0,25 


Q 






0,2 




0,15 




0,1 




0,05 








Per Symbol Entropy Delta 







Delimiter Threshold 


1 


it 


j 


















t 






















St 


k 





















Symbol 



Symbol 



Figure 4: Entropy per Symbol (left) and Delta of Entropy per Symbol (right) - Ordered De- 
scending 



for 5^, S^, which the set of n-grams in a corpus, then the delta value DuiS*^) of an n-gram 
S'^ with a value as: 

DH{Hk)=\Hk+i-H{k)lk<M 

where |a;| is the absolute value operator. 

The entropy value corresponding to the local maximum of the entropy delta in the right half 
of the symbol-entropy delta function is selected as the threshold (depicted by a dark circle in the 
figure) for determining delimiter characters. This happens, because we consider exactly three 
areas, as is depicted in the left part of figure 4, and we expect their split points to be on either 
side of the middle symbol in the delta-ordered list of symbols. Formally, the threshold Dh.o is 
given by: 



D_ff = argmax (15) 

where [x\ is the floor operator, returning the closest integer lower than or equal to x. 

In our application we have only checked for unigrams, i.e., simple letters, as delimiters, 
even though delimiters of higher rank can be determined. For example, in bi-gram chunking 
the sequence (comma and space) would be considered to be delimiter, while in unigrams the 
space character only would be considered delimiter. Given a character sequence S'„ and a set of 
delimiters D, our chunking algorithm splits the string after every occurrence of a delimiter d e D. 
This way a sentence is split into a set of chunks, that can be then assigned salience, during the 
content selection process. 



4.2 Query Expansion 

Summarization systems like the ones presented within the DUG and TAG communities need to 
be able to respond to specific queries. In information retrieval, for some time now, the use of 
query expansion has been shown to be useful at times. We wanted to try using query expansion 
in this summarization system as well to determine whether it offers improvement to the system 
performance. 
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To perform query expansion we use a two step process. First, we devise a methodology that 
can map a sentence to a set of concepts, provided external knowledge. Once again, our analysis 
of the sentence tries to remain as language-independent as possible, even though the use of the 
external resource may dictate a specific language. The second step for the query expansion is 
the use of the concepts descriptions mapped to the query sentence, in order to expand the graph 
representation of the query. We elaborate on these two steps in the following sections. 

4.2.1 Mapping a Sentence to a Set of Concepts Using External Knowledge 

Within the scope of our work we tried to map sentences to concepts (i.e., to terms with 
well defined meaning). Usually this happens by looking up words in thesauri (e.g., Word- 
Net [Miller et al., 1990]). In our approach we have used a decomposition module based on the 
notion of the symbolic graph. 

A symbolic graph is a graph where each vertex contains a string and edges are connecting 
vertices in a way indicative of a substring relation. As an example, if we have two strings abc, ah 
labeling two corresponding vertices, then since ab is a substring of abc there will be a directed 
edge connecting ab to abc^. In general, the symbolic graph of a given text T contains every 
string in T and for every string it illustrates the set of substrings that compose it. 

When a symbolic graph has been constructed, then one can run through all the vertices of the 
graph and look up each vertex in a thesaurus to determine if there is a match with an entry. If 
the thesaurus contains an entity lexicalized by the vertex string, then the vertex is assigned the 
corresponding term meaning. In cases of polysemous terms, then the vertex is annotated with 
all possible meanings of this term. This annotated graph, together with a facility that supports 
comparing meanings is what we call the semantic index. 

The semantic index, therefore, represents links between n-grams and their semantic counter- 
parts, implemented as e.g., WordNet definitions which are textual descriptions of the possible 
senses. Such definitions are used within Example 4.1. If Di,D2 are the sets of definitions of 
two terms <i,i2, then to compare the semantics (meaning) mi,m2 of ^1,^2 using the semantic 
index, we actually compare the n-gram graph representation Gu, G2j, 1 < « < \Di\, I < j < I-D2I 
of each pair of definitions of the given terms. Within this section we consider the meaning of 
a term to map directly to the set of possible senses the term has. The rclatcdness of meaning 
relMcaning is Considered to be the averaged sum of the similarities over all pairs of definitions of 
the compared terms when represented as n-gram graphs: 

1 , X EGi,,G2,sim(Gi„G2j) 

relMcaning (il, ^2) " foJ^^D^l ^ ^ 

This use of relatedness implies that uncertainty concerning the actual meaning of terms is 
handled within the measure itself, because many alternative senses, i.e., high |Di|,|Z?2|, will 
cause a lower result of relatedness. An alternative version of the relatedness measure, that only 
requires a single pair to be similar to determine high relatedness of the meanings is the following. 

relMoaning'(ti,t2) = maxQi. .Gsj sim(Gii , G'2j ) (17) 
Within our examples in this section we have used equation 16. 

Example 4.1 Compare: smart, clever 
WordNet sense definitions for 'clever': 

• cagey, cagy, canny, clever - (showing self-interest and shrewdness in dealing with others; "a cagey lawyer"; 
"too clever to he sound") 

^The symbolic graph can also be represented more efficiently as a trie [Fredkin, I960]. 
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Table 1: Other examples of comparisons in ascending relatedness value: using sense overview 

• apt, clever - (mentally quick and resourceful; "an apt pupil": "you are a clever man. ..you reason well and 
your wit is bold"-Bram Stoker) 

• clever, cunning, ingenious - (showing inventiveness and skill; "a clever gadget"; "the cunning maneuvers 
leading to his success"; "an ingenious solution to the problem") 

WordNet sense definitions for 'smart': 

• smart - (showing mental alertness and calculation and resourcefulness) 

• chic, smart, voguish - (elegant and stylish; "chic elegance"; "a smart new dress"; "a suit of voguish cut") 

• bright, smart - ( characterized by quickness and ease in learning; "some children are brighter in one subject 
than another" ; "smart children talk earlier than the average" ) 

• fresh, impertinent, impudent, overbold, smart, saucy, sassy, wise - (improperly forward or bold; "don't be fresh 
with me"; "impertinent of a child to lecture a grownup" ; "an impudent boy given to insulting strangers" ; "Don't 
get wise with me!") 

• smart - (painfully severe; "he gave the dog a smart blow") 

• smart - (quick and brisk; "I gave him a smart salute" ; "we walked at a smart pace") 

• smart - (capable of independent and apparently intelligent action; "smart weapons") 
Relatedness of meaning (rel^^^aning) : 0.0794 

In table 1, we present some more pairs of terms and their corresponding relatedness values. 
These preliminary results indicate that, even though the measure appears to have higher values 
for terms with similar meaning, it may be biased when two words have similar spelling. This 
happens because the words themselves appear in the definitions, causing a partial match between 
otherwise different definitions. 

The results further depend heavily on the textual description — i.e., definition — of any 
term's individual sense (synset in WordNet). The results from the examples when using synonyms 
as descriptors of individual senses can be seen in table 2. We notice that e.g., the words 'smart' 
and 'clever' are found to have no relation whatsoever, because no common synonyms are found 
within WordNet. Furthermore, since a given word always appears in its own synonym list, word 
spelling similarity still plays an important role when judging relatedness between senses of two 
different words. For example, the words 'hollow' and 'holler' have an important overlap in terms 
of spelling. This makes some of their senses have high overlap, which then biases the comparison 
process to consider the corresponding senses related. 

The use of a semantic index is that of a meaning look-up engine. We remind the reader that 
the semantic index is actually an annotated symbolic graph. If there is no matching vertex in 
the graph to provide a meaning for a given input string, then the string is considered to have the 
meaning of its closest, in terms of graph path length, substrings that have been given a meaning. 
This 'inheritance' of meaning from short to longer strings is actually based on the intuition 
that a text chunk contains the meaning of its individual parts. Furthermore, a word may be 
broken down to elementary constituents that offer meaning. If one uses an ontology or even a 
thesaurus including prefixes, suffixes or elementary morphemes and their meanings to annotate 
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Table 2: Other examples of comparisons in ascending relatedness value: using only synonyms 
from overview 

the symbolic graph, then the resulting index becomes a powerful semantic annotation engine, 
in that it can combine the meaning of sub-word items to determine the word meaning. On the 
other hand, since composite words or collocations are not necessarily the simple composition of 
the meaning of their parts, in several cases we expect the resulting meaning annotation to be 
false. 

In the context of this work, we have combined a symbol graph and WordNet into a semantic 
index to annotate queries with meanings and perform query expansion, knowing that the process 
will be non-optimal. 

4.2.2 Query Expansion Through Graph Merging 

Query expansion is based on the assumption that a set of words related to an original query can 
be used as part of the query itself to improve the set of the returned results. In the literature 
much work has indicated that query expansion should be carefully applied in order to improve 
results [Voorhees, 1994, Qiu and Frei, 1993]. 

In our approach, we have used a semantic index based on WordNet 's 'overview of senses '. 
An example of such 'overview of senses' for the words 'test', 'ambiguous' can be seen in Example 
4.2. 

Example 4.2 Overview of verb test 

The verb test has 7 senses (first 3 from tagged texts) 

1 . (32) test , prove , try, try out , examine , essay 

— (put to the test, as for its quality, 
or give experimental use to; 

"This approach has been tried with good results"; 
"Test this recipe") 

2 . (9) screen, test 

— (test or examine for the presence of disease or 
infection; "screen the blood for the HIV virus") 

3. (4) quiz, test — (examine someone's knowledge of something; 
"The teacher tests us every week"; 

"We got quizzed on French irregular verbs") 

4. test — (show a certain characteristic when tested; 
"He tested positive for HIV") 

5. test — (achieve a certain score or rating on a test; 
"She tested high on the LSAT and was admitted to 

all the good law schools") 

6. test — (determine the presence or properties of (a substance)) 

7. test — (undergo a test; "She doesn't test well") 

Overview of adj ambiguous 

The adj ambiguous has 3 senses (first 3 from tagged texts) 

1. (9) equivocal, ambiguous — (open to two or more interpretations; 
or of uncertain nature or significance; 

or (often) intended to mislead; "cin equivocal statement"; 
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"the polling had a complex and equivocal Cor ambiguous) message 
for potential female candidates"; 

"the officer's equivocal behavior increased the victim's uneasiness"; 
"popularity is an equivocal crown"; 

"an equivocal response to an embarrassing question") 

2. (4) ambiguous — (having more than one possible meaning: 
"ambiguous words"; "frustrated by ambiguous instructions, 
the parents were unable to assemble the toy") 

3. (1) ambiguous — (having no intrinsic or objective meaning; 
not organized in conventional patterns; 

"an ambiguous situation with no frame of reference" ; 
"ambiguous inkblots") 

For a given word w in the original query, the overview of senses is returned by the semantic 
index; from these senses Si,i > we only utilize senses Sj with graph representations Ggj whose 
(graph) similarity to the common content graph Cu is greater than a given threshold (see section 
3.1 for the definitions of the used functions): 

Gs, n Cu ^ and VS°(G,^,Cu) > t,t e R+. (18) 

We remind the reader that the content graph is the intersection of all the graph representations 
of the texts in the set. 

Finally, the query is expanded by merging the representation of the original query Gq and the 
representations Gg . of all the j senses that have been 'filtered' according to the procedure, giving 
a new query-based content definition Cu'. Having calculated Cu', we can judge the importance 
of sentences simply by comparing the graph representation of each sentence to the Cu'. We refer 
to the removal of the 'irrelevant' definitions from the overview of senses as sense filter. 

Even though the query expansion process was finally rather successful, in our first attempt 
for query expansion noise was added due to chunks like 'an', 'in' and 'o' which were directly 
assigned the WordNet meanings of 'angstrom', 'inch' and 'oxygen' correspondingly. This lowered 
the evaluation scores of our submitted runs in TAG 2008 [Giannakopoulos ct al., 2008a]. Using 
the sense filter, the deficiency has been tackled in the current version of the system, significantly 
improving overall performance. 

4.3 The Content Selection Process 

Concerning the content matching part of the presented summarization system, the following 
basic assumptions have been made. 

• The content Cu of a text set (corpus) U is considered to be the noise-free intersection 
of all the graph representations of the texts in the set: Cu = f]*^^Gt, where Ct is the 
graph representation of text t, over all the (arbitrary selected) n-gram ranges. In other 
words, we consider that the common part of the graph representation of all the texts in a 
topic, indicates the common content of the texts. Through the (optional) query expansion 
process one can finally use, instead of Cu, the query-based content definition Cu', which is 
determined as described in the previous section. 

• A sentence S is considered more similar to the content Cu of a text set, as more of the 
sentence's chunks (sub-strings of a sentence) have an n-gram graph representation similar 
to the corresponding content representation. Every chunk's similarity to the content is 
added to the overall similarity of a sentence to the content. The chunks of a sentence 
are extracted using the aforementioned entropy-based approach (section 4.1.1). As we will 
see in the experiments in section 5, we test whether the use of chunks differentiates the 
performance of the system as opposed to the case that each sentence is considered a single 
chunk. 
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According to the chunking process, each sentence is assigned a score, which is actuaUy the 
sum of the Normahzed Value Similarities (see Eq. 3) of its chunks to the content. This process 
offers an ordered list of sentences L. A naive selection algorithm would then select the highest- 
scoring sentences from the list, until the summary word count limit is reached. However, this 
would not take redundancy into account and, thus, this is where redundancy removal comes in. 

4.4 Redundancy Removal and Novelty Detection 

In the composition of the final summary, the step of redundancy control is fully integrated. 
As we will shortly describe, two alternatives were studied concerning the control of redundant 
information: One detects novelty of a new piece of information with respect to the current 
summary snapshot or to a user model; the other detects redundancy among the whole set of 
source information pieces, before the creation of the summary. These approaches just represent 
two different views of the same problem, which is the control of redundant information within a 
multi-document summary. 

The novelty detection process has two aspects, the intra-summary novelty and the inter- 
summary or user-modeled novelty. 

The intra-summary novelty refers to the novelty of a sentence in a summary, given the content 
of the summary at a specific time point. In order to ensure intra-summary novelty, one has to 
make sure that every sentence minimally repeats already existing information. To achieve this 
goal, we use the following process: 

1. Extract the n-gram graph representation of the summary so far, indicated as Gsum- 

2. Keep the part of the summary representation that does not contain the common content 
of the corresponding document set U, G'^^^^ — Ggum A Cn. 

3. For every candidate sentence in L that has not been already used 

(a) extract its n-gram graph representation, Gcs- 

(b) keep only G'^^ = Gcs A Cu, because we expect to judge redundancy for the part of the 
n-gram graph that is not contained in the common content Cu. 

(c) assign the similarity between G'^g, G'^^^ as the sentence redundancy score. 

4. For all candidate sentences in L 

(a) Set the score of the sentence to be its rank based on the similarity to Cu minus the 
rank based on the redundancy score. 

5. Select the sentence with the highest score as the best option and add it to the summary. 

6. Repeat the process until the word limit has been reached or no other sentences remain. 

The inter-summary or user-modeled novelty refers to the novelty of information apparent 
when the summarization process takes into account information already available to the reader 
(as per the TAG 2008 update task). This information can be contained in a user model, keeping 
track of the most recent summaries provided to that user. In the TAG 2008 summarization task, 
systems are supposed to take into account the first of two sets per topic, set A, as prior user 
knowledge for the summary of set B of the same topic. In fact, set A contains documents con- 
cerning a news item (e.g., Antarctic ice melting) that have been published before the documents 
in set B. We have used the content of the given set A, Gua, in the redundancy removal process 
considering it to be the pre-existing user model. To do that we always merged the representation 
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of set A to the representation of the current snapshot of the summary. In other words, the 
content of set A appears to always be included in the current version of the summary and, thus, 
new sentences avoid redundancy with respect to A. 

Within this work we have also implemented a method of redundancy removal, as opposed to 
novelty detection, where redundancy is pinpointed within the original set of candidate sentences: 
We consider a sentence to be redundant, if it surpasses an empirically computed threshold — in 
the experiments this threshold had a value of 0.2'' — of overlap to any other candidate sentence. 
In each iteration within the redundancy removal process each sentence is compared only to 
sentences not already marked as redundant. As a result of this process, only the sentences that 
are not marked as redundant are used in the output summary. 

Given the whole set of tools we described so far, we now provide some experimental results 
of applying variations of the aforementioned methodology and the conclusions reached. 

5 Experiments 

We conducted numerous experiments using the TAG 2008 corpus. We consider each variation of 
our system based on a different parameter set to be a different system, with a different System ID. 
Our main target is to see how our components affect the summarization process as a whole and 
not to judge individual steps separately. Ideally, the summaries should be judged by humans, 
but there are automatic methods [Lin, 2004, Hovy et al., 2005, Giannakopoulos et al., 2008b] 
that correlate well to human judgment . 

We have used the AutoSummENG [Giannakopoulos et al., 2008b] as our system evaluation 
method, since it consistently correlates well to the DUG and TAG manually-assigned respon- 
siveness measure. Responsiveness, first appeared in the Document Understanding Gonference 
(DUG) of 2005. This extrinsic measure has been used in later DUGs as well. In DUG 2005, the 
appointed task was the synthesis of a 250-word, well-organized answer to a complex question, 
where the data of the answer would originate from multiple documents [Dang, 2005]. In DUG 
2005, the question the summarizing 'peers', i.e., summarizer systems or humans, were supposed 
to answer consisted of a topic identifier, a title, a narrative question and a granularity indication, 
with values ranging from 'general' to 'specific'. The responsiveness score is an extrinsic measure 
that was supposed to provide, as Dang states in [Dang, 2005], a 'coarse ranking of the summaries 
for each topic, according to the amount of information in the summary that helps to satisfy the 
information need expressed in the topic statement, at the level of granularity requested in the 
user profile'. 

In the 'Automatically Evaluating Summaries of Peers ' (AESOP) task of TAG 2009, the 
AutoSummENG method was shown to still be one of the top-performing methods in terms 
of correlation to responsiveness [Dang, 2009, Giannakopoulos and Karkalctsis, 2009]. Thus, the 
evaluation using the automatic AutoSummENG measure is meant to result in a partial ordering, 
indicative of how well a given summary answers a given question, taking into account the speci- 
fications of the question. In the following paragraph we define the task upon which MUDOS-NG 
was evaluated using AutoSummENG. 

In TAG 2008 there were two tasks. The main task was to produce a 100-word summary from 
a set of 10 documents (Summary A). The update task was to produce a 100- word summary from 
a set of subsequent 10 documents, with the assumption that the information in the first set is 
already known to the reader (Summary B)'*. There were 48 topics with 20 documents per topic 

^The threshold should be computed via experiments or machine learning to relate with human estimated 
redundancy of information, but this calculation has not been performed in this work. 

*See http: //www.nist .gov/tac/publications/2008/preseiitations/TAC2008_UPDATE_overview.pdf for an 
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System ID 


CS 


SS 


RR 


ND 


QE 


NE 


Score 


1 




/ 




/ 




/ 


0.1202 


2 




/ 


/ 






/ 


0.1303 


3 


/ 




/ 




/ 




0.1218 


4 




/ 




/ 


/ 




0.1198 


5 




/ 


/ 




/ 




0.1299 


6 


/ 












0.1255 



Table 3: AutoSummENG summarization performance for different settings concerning scoring, 

redundancy and query expansion. Legend CS: chunk Scoring, SS: Sentence Scoring, RR: Redundancy Removal, 
ND: Novelty Detection, QE: Query Expansion, NE: No Expansion. Best performance in bold. 

in chronological order. Each summary was to be extracted based on a topic description defined 
by a title and a narrative query. For every topic 4 model summaries were provided for evaluation 
purposes. 

At this point we indicate the pitfalls in using an overall evaluation measure like AutoSum- 
mENG, ROUGE [Lin, 2004] or Basic Elements [Hovy et al., 2005] (also see [Belz, 2009] for a 
related discussion): 

• Small variations in system performance are not indicative of real performance change, due 
to statistical error. 

• The measure can say little about individual summaries, because it correlates really well 
when judging a system. 

• The measure cannot judge performance of intermediate steps, because it judges the output 
summary only. 

• The measure can only judge the summary with respect to the given model summaries. 

Given the above restrictions, we have performed experiments to judge the change in performance 
when using: 

• chunk salience scoring versus sentence salience scoring. 

• redundancy removal versus novelty detection. 

• query expansion versus no query expansion. 

In addition to the results of applying the different system configurations on the summariza- 
tion task, indicated in table 3, we performed an ANOVA (analysis of variance) test to determine 
whether the System ID — i.e., system configuration — is an important factor for the AutoSum- 
mENG similarity of the peer text to the model texts. It was shown (with a p-value below 10^^^) 
that: 

• There are topics of various difficulty and the topic is an important factor for system per- 
formance. 

• Selection of different components for the summarizer, from the range of our proposed 
components, can affect the summaries' quality. The finding was in fact that the SystemID 
is an important factor of the performance. 

overview of the Text Analysis Conference, Summarization Update Task of 2008. 
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System (TAG 2008 SysID) 



AutoSummENG Score 



Top Peer (43) 
Last Peer (18) 
Peer Average (All Peers) 



0.1991 
0.1029 
0.1648 (Std. Dev. 0.0216) 



Proposed System (-) 



0.1303 



Table 4: AutoSummENG performance data for TAG 2008. NOTE: The top and last peers are based on 
the AutoSummENG measure performance of the systems. 

The systems using chunk scoring have no statistically significant difference in performance 
from the ones that use sentence scoring, as the (paired) t-test gave a p- value of 0.64. However, the 
systems using chunk scoring, namely systems 3 and 6, had a slightly lower average performance 
than the others. The systems using redundancy removal appear to have statistically significant 
difference in performance from the ones that use novelty detection, nearly at the 0.05 confidence 
level (one-sided t-test). System 6 was chosen to not use any redundancy removal method and 
performs near the average of all other systems, thus no conclusion can be drawn. Goncerning 
query expansion, it was not proved whether query expansion indeed offers improvement, as the 
t-test gave a p-value of 0.74. This result is consistent with the 'slight improvement' indicated 
in [Gonroy et al., 2006], where the reverse mapping of a Porter stemmer was used to expand the 
query with other versions of its words (e.g., adding 's' as verb suffix or noun suffix to terms in the 
query). Similar, non-decisive, results were found by [Blake et al., 2007] where query expansion 
was determined to be of little use, after experiments were applied. 

In table 4 information on the average performance of TAG 2008 participants over all topics is 
illustrated. More on the performance of TAG 2008 systems can be found in [Dang and Owczarzak, 
Our system performs slightly below average but quite better than the last successful participant. 

This is very encouraging for the potential of the proposed summarization method, as it is 
based on generic algorithms performed on a generic representation, providing core operators for 
addressing the difficulties in each single summarization step. Moreover, the language neutrality 
of the method shows that it may provide a steady basis, which can be made more effective 
when combined with heuristics and machine learning methods exploiting language-dependent 
characteristics. 

To further examine the performance of our system in other corpora, we performed summa- 
rization using the configuration that performed optimally in the TAG 2008 corpus on the corpora 
of DUG year 2006. Systems in DUG 2006 were to synthesize from a set of 25 documents a brief, 
well-organized, fluent answer to a non-trivially expressed declaration of a need for information. 
This means that the query could not be answered by just stating a name, date, quantity, or 
similar singleton. The organizers of DUG 2006, NIST, also developed a simple baseline system 
that returned all the leading sentences of the 'TEXT' field of the most recent document for each 
topic, up to 250 words [Dang, 2006]. 

In Table 5 we illustrate the performance of our proposed system on the DUG 2006 corpus. 
It is shown that the system strongly outperforms the baseline system, and is less than a stan- 
dard deviation (0.0170) below the AutoSummENG mean performance (0.1842) of all the 35 
participating systems. 

From the comparison between the results on the DUG 2006 and the TAG 2008 task we can 
conclude that our proposed system performed better in terms of responsiveness in the generic 
summarization task of DUG 2006 than in the update task of TAG 2008. To identify the exact 
defects of the TAG summaries is non-trivial and requires further investigation, across several 
dimensions: 

• What are the problems that affect the performance? Is it the content selection, the ordering 
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System (DUC 2006 SysID) 



AutoSummENG Score 



Baseline (1) 

Top Peer (23) 

Last Peer (11) 

Peer Average (All Peers) 



0.1842 (Std. Dev. 0.0170) 



0.1437 
0.2050 
0.1260 



Proposed System (-) 



0.1783 



Table 5: AutoSummENG performance data for DUC 2006. note: The top and last peers arc based on 

the AutoSummENG measure performance of the systems. 

of sentences, the anaphora problems, the lack of coherence, or something else? 

• Can the problem or error be quantified? If yes, how? If not, can a qualitative ranking of 
quality be applied and, then, approximated by a some evaluation methodology? 

• How can we minimize the error or maximize the quality? 

Nevertheless, it is very important that the proposed summarization components offered com- 
petitive results without using machine learning techniques combined with a rich set of sentence 
features^ like sentence position or grammatical properties. This indicates the usefulness of n-gram 
graphs as well as the generality of application of the n-gram graph operators and functions. How- 
ever, other components need to be added to reach state-of-the-art performance, given the existing 
means of evaluation. These components should aim to improve the overall coherence of the text 
and tackle problems of anaphora resolution (for examples of such problems see the summary in 
the appendix section A). 

6 Discussion and future work 

We have offered a generic method, based on the language-neutral representation and algorithms 
of n-gram graphs, aiming to tackle a number of automatic summarization problems: 

Salience detection We have indicated ways to determine the content of a cluster of documents 
and judge salience for a given sentence. 

Redundancy removal We have presented two different approaches following the CSIS and 
MMR paradigms. 

Query expansion We have proposed a scheme to broaden a given query, with a slightly im- 
proving effect over the summaries. The query expansion module is partially dependent on 
the language, in that it requires a thesaurus in the same language as the original query to 
perform expansion. 

From the alternatives we examined within the experiments, as far as responsiveness of the 
summaries is concerned, it stands that: 

• whole-sentence scoring should be preferred to chunk-based sentence scoring. 

• query expansion does not offer significant improvement, even though it does not appear to 
penalize the performance either. 

• redundancy removal performs better that novelty detection. 
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The newspaper International Herald Tribune reported on Friday that production problems 

Figure 5: Sample summary for the D0801-A topic of TAG 2008 on tlie A380 airbus production 
and launch news. 

The experimental results presented judged only one aspect of our system; namely its re- 
sponsiveness. Based on these results we have seen that combining different methods for the 
components of the overall summarization method, one can achieve significantly different results. 
It is very important, however, that the proposed summarization components offered competi- 
tive results through simple application of n-gram graph theory and methodologies, without any 
optimization on specific corpora. 

As a side result, we have shown that there is a way to detect and remove noisy patterns from 
within n-gram graphs, using simple graph operators. Furthermore, we have illustrated that the 
removal of these patterns can improve the results of certain tasks. These tasks, like classification 
and topic detection, should be investigated through the n-gram graph representation prism to 
determine the potential of this representation as a generic NLP tool. 

It is obvious from the experiments that individual components are not easy to judge as parts 
of a summarization system. Focused and exhaustive performance evaluations should be carried 
out to identify the impact of each component to the overall performance. It might also be needed 
that components are examined outside the summarization context, as stand-alone methods for 
chunking, semantic annotation, redundancy detection, etc. 

In the future, we plan to test the effect of using various n-gram ranks within different parts 
of the summarization process. We have so far intuitively concluded that n-grams of lower ranks 
express the grammar model of a given language, i.e., the set of allowed sequences of char- 
acters, while higher rank n-grams cross over the word boundaries and offer topic information. 
In [Giannakopoulos ct al., 2008b] there exists a methodology for the detection of statistically- 
determined important substrings of a text, called symbols within the context of that work. The 
use of these symbols, only, within n-gram graphs may alleviate the insertion of noise within the 
different summarization steps and diminish the computational cost of the method. Furthermore, 
emergent subgraphs and paths within a document set graph may allow for the extraction of 
non-obvious relations between text snippets as well as the detection of discourse phenomena and 
subtopics within a document set. These phenomena and subtopics can then be used to improve 
the structure and sentence ordering of the summary, as it has been shown that sentence ordering 
has an important effect on summarization [Dang, 2009, Barzilay ct al., 2002]. 

Last, but definitely not least, we need to evaluate the summaries extracted in any of the given 
corpora under the view of additional textual qualities — i.e., regardless of any responsiveness- 
related score. We should identify in what way the individual extracted summaries (see Figure 
5 for a sample summary and Appendix A) are worse from gold-standard summaries. Only then 
will we be able to improve on our promising current work. 

We note that the whole MUDOS-NG source code is available — and under constant revision 
and improvement — as part of the JINSEGT open source project^ to facilitate the study of the 
methods we have presented. 

^See http://sourceforge.net/projects/jinsect/ for more. 
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Over all, the rate of substance abuse amiong Native American adults is over 20 percent nationwide. 

Figure 6: Sample summary for the D0601A topic of DUG 2006 
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A Sample Summary and Scoring 

Here we provide a sample DUC 2006 corpus topic (Example A.l) and the MUDOS-NG extracted 
summary in Figure 6. 

Example A.l Topic Definition: 
<topic> 

<num> D0601A </num> 

<title> Native American Reservation System - pros and cons </title> 
<narr> 

Discuss conditions on AmericEin Indian 
reservations or among 
Native American communities. 

Include the benefits and drawbacks of the reservation system. 

Include legal privileges and problems . 

</narr> 

</topic> 

In table 6 we indicate the top 10 candidate sentences from the document set used to extract 
the summary. The sentences appear in decreasing order of their score. It is important to note that 
some of the sentences were removed from the redundancy removal step, therefore not appearing 
in the final summary. 
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Value 


Sentence 


0.331996 


The income disparity on the reservation has widened. 


0.300312 


Many reservation residents are frustrated. 


0.276393 


Over all, the rate of substance abuse among Native American adults is over 20 percent nationwide. 


0.269523 


But not everyone on the reservation is happy about the growth. 


0.256380 


A retail shopping center is proposed for a nearby section of the reservation. 


0.255839 


And the migration from the reservations continues. "Mostly, it's economics," says Joanne Dunne, 
a spokeswoman for the Boston Indian Council, a nonprofit cultural group. "The reservation 
typically doesn't provide you with any real opportunities. '' Also, many Native Americans 
travel between the reservation and urban areas. "Some tribes traditionally go from one place to 
another," says Dunne. "The Micniacs always crossed the border from Canada to come from time 
immemorial. " In other cities, city and state governments have created agencies to specifically 
deal with the Native American population. 


0.252962 


As a result, some urban Native Americans feel driven away. 


0.235350 


Prom 1980 to 2000, the urban Native American population has more than doubled. 


0.235100 


But a non-Indian resident of a reservation has no say in tribal government. 


0.216477 


Smith and thousands like her arc seeking help for their substance abuse at the American Indian 
Community House, the largest of a handful of Native American cultural institutions in the New 
York area. '"Where else can I go for help? " she asks. "Any place else, they don't understand 
you like they do here. " Native Americans around the country are leaving reservations and 
relocating in urban areas at a dizzying rate. 



Table 6: The top 10 sentences provided from MUDOS-NG and their scores (sentence scoring). 
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